Why is EDA so important?

Data can be very sneaky! Presumably everyone here has the same intuition: the data below are quite different from each other - right?


Meet the Datasaurus Dozen!

Meet the Datasaurus Dozen!


It turns out that if we only looked at commonly computed summary statistics -> we would think that they all are the same.


While different in appearance, each dataset has the same summary statistics (mean, standard deviation, and Pearson’s correlation) to two decimal places.

While different in appearance, each dataset has the same summary statistics (mean, standard deviation, and Pearson’s correlation) to two decimal places.


You REALLY need to intimately know your data! Here are seven distributions of data, shown as raw data points (or strip-plots), as box-plots, and as violin-plots.

You REALLY need to intimately know your data! Here are seven distributions of data, shown as raw data points (or strip-plots), as box-plots, and as violin-plots.

These fantastic visualizations and datasets are from here.


Learning objectives


  1. Intro to plotting data with ggplot2
  2. Data exploration - what your data “look” like?
    • Histograms and Density plots
    • Box-, violin-, and jitter-plots
  3. Comparing/contrasting data.
    • Correlation
    • Clustering
    • PCA (principle components analysis)

Review previous class concepts

  1. Working directory and the here package
  2. Tidy-ing and summarizing data
    • join
    • summary by group
here()

getwd()

# setwd()

data_genelevel <- read_delim(file = here("2-class2",
                                         "data",
                                         "data_genelevel.tsv"),
                   delim = "\t",
                   escape_double = FALSE,
                   trim_ws = TRUE)

dux_targets <- read_csv(file = here("2-class2",
                                    "data",
                                    "target_genes.csv"
                                    )
                        )

# lib_counts <- data.frame("lib"=colnames(data_genelevel[,-1]),
#                          "counts"=colSums(data_genelevel[,-1])
#                          )


summary(data_genelevel) # let's get a feel for the data - it is in long format



data_genelevel$target <- if_else(condition = data_genelevel$gene_symbol %in% dux_targets$hgnc_symbol,
        true = "target",
        false = "not_target"
          )

table(data_genelevel$target)

data_genelevel %>% group_by(target) %>% summarise_all(funs(mean))
# boo 'gene_symbol'

data_genelevel %>% group_by(target) %>% select(-gene_symbol) %>% summarise_all(funs(mean)) 

long_genelevel <- melt(data_genelevel) # convert from wide-to-long
names(long_genelevel) <- c("gene_symbol","Target","Sample", "Count") # rename columns

head(long_genelevel)
summary_target_sample <- long_genelevel %>% group_by(Target,Sample) %>% select(-gene_symbol) %>% summarise(Mean=mean(Count)) 

summary_target_sample

1. Intro to plotting data with ggplot2

R has many useful plotting functions that come installed (base functions), for example hist(), plot(), and barplot(). I typically use these functions on the fly while I’m coding to perform quick checks. However, they are a limited, a bit clunky, and far from aesthetically appealing for making convincing graphs/plots. A popular alternative is ggplot2. While graphs produced using ggplot2 defaults are (imo) not quite publication qualitym with a few tweaks you can pretty much get there! We will cover the following ggplot topics:


A. The basic syntax of ggplot()

ggplot(): build plots piece by piece The concept of ggplot divides a plot into three different fundamental parts:

plot = data + aesthetics + geometry.

data: a data frame
aesthetics: specify x and y variables, and other features - color, size, shape, etc.
geometry: specify type of plots - histogram, boxplot, line, density, dotplot, bar, etc.

B. Producing different types of plots

+ Histograms and density plots
+ Box-, violin-, and jitter-plots
+ Summary tables

C. Additional considerations (will return to in class #4)

+ Multiple plots
+ Themes
+ Titles, axes, and legends
+ Colors
+ Scales

2. Data exploration - what your data “look” like?

Ok let’s get this show on the road and visualize some data!!!

# histogram
ggplot(data = data_genelevel, aes(x = hour00_rep1)) +
  geom_histogram(bins = 50) +
  theme_few() +
  ggtitle("histogram")


# histogram log10-scale
ggplot(data = data_genelevel, aes(x = log10(hour00_rep1))) +
  geom_histogram(bins = 50) +
  theme_few() +
  ggtitle("histogram log10-scale")

# histogram log10-scale -> my preferred route
ggplot(data = data_genelevel, aes(x = hour00_rep1)) +
  geom_histogram(bins = 50) +
  scale_x_log10() +
  theme_few() +
  ggtitle("histogram log10-scale")

# histogram log10-scale with pseudocount
ggplot(data = data_genelevel, aes(x = hour00_rep1)) +
  geom_histogram(bins = 50) +
  scale_x_log10() +
  theme_few() +
  ggtitle("histogram log10-scale with pseudocount")


# density log10-scale with pseudocount
ggplot(data = data_genelevel, aes(x = hour00_rep1 + 1)) +
  geom_density(bins = 50) +
  scale_x_log10() +
  theme_few() +
  ggtitle("density log10-scale with pseudocount")

# scattter plot
ggplot(data = data_genelevel, aes(x = hour00_rep1, y = hour00_rep2)) +
  geom_point() + 
  theme_few() +
  ggtitle("scattter plot")

# scattter plot log10-scale with pseudocount
ggplot(data = data_genelevel, aes(x = hour00_rep1 + 1, y = hour00_rep2 + 1)) +
  geom_point() + 
  scale_x_log10() +
  scale_y_log10() +
  theme_few() +
  ggtitle("scattter plot log10-scale with pseudocount")


# density of all samples log10-scale with pseudocount
ggplot(long_genelevel, aes(x=Count + 1, color = Sample)) + 
  geom_density() +
  scale_x_log10() +
  theme_few() +
  theme(legend.position="right") +
  ggtitle("density samples log10-scale with pseudocount")

# boxplot of all samples log10-scale with pseudocount
ggplot(long_genelevel, aes(y=Count + 1, x = Sample, color = Sample)) + 
  geom_boxplot() +
  scale_y_log10() +
  theme_few() +
  theme(legend.position="right") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("boxplot samples log10-scale with pseudocount")

# violinplot of all samples log10-scale with pseudocount
ggplot(long_genelevel, aes(y=Count + 1, x = Sample, color = Sample)) + 
  geom_violin() +
  scale_y_log10() +
  theme_few() +
  theme(legend.position="right") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("violinplot samples log10-scale with pseudocount")



# jitterplot of Dux4 targets
ggplot(long_genelevel %>% filter(Target=="target"), aes(y=Count + 1, x = Sample, color = Sample)) + 
  geom_jitter(width = 0.25) +
  scale_y_log10() +
  theme_few() +
  theme(legend.position="right") +
  ggtitle("jitterplot samples log10-scale with pseudocount")

3. Comparing/contrasting data

Let’s ask how similar the samples are to each other

Correlation

The two most common correlation methods are Pearson and Spearman. The Pearson method looks at the linear relationship between two or more variables. The Spearman method looks at the rank relationship between two or more variables. For more info click here

I like the corrplot package for visualization of correlations.

# keep only numeric columns for correlatioin
cor_data <- data_genelevel %>% select_if(is.numeric)


# pearson assumes a linear realationship!!
cor(x = cor_data$hour00_rep1, y = cor_data$hour00_rep2, method = "pearson")
cor(x = cor_data$hour00_rep1, y = log10(cor_data$hour00_rep2 + 1), method = "pearson")
cor(x = cor_data$hour00_rep1, y = log10(cor_data$hour00_rep2 + 1), method = "spearman")

# we can caclulate all pairwise correlation coefficients
# remember the pseudocounts! 
cor(x = log10(cor_data + 1), method = "pearson")

gene_cor_pearson <- cor(x = log10(cor_data + 1), method = "pearson")

gene_cor_spearman <- cor(x = cor_data, method = "spearman")



corrplot(corr = gene_cor_pearson,
         addCoefasPercent = T,
         addCoef.col = "white",
         number.cex = .8,
         diag = T
         )

corrplot(corr = gene_cor_spearman,
         addCoefasPercent = T,
         addCoef.col = "white",
         number.cex = .8,
         diag = T
         )


corrplot(corr = gene_cor_pearson,
         addCoefasPercent = T,
         addCoef.col = "white",
         number.cex = .8,
         diag = T,
         order = "hclust",
         addrect = 4,
         rect.col = "red",
         title = "Pearson"
         )

corrplot(corr = gene_cor_spearman,
         addCoefasPercent = T,
         addCoef.col = "white",
         number.cex = .8,
         diag = T,
         order = "hclust",
         addrect = 4,
         rect.col = "red",
         title = "Spearman"
         )



corrplot(corr = gene_cor_pearson,
         type = "upper",
         col = rev(viridis(20)),
         addCoefasPercent = T,
         addCoef.col = "white",
         number.cex = .8,
         diag = F,
         title = "Pearson Upper Triangular"
         )

Clustering

We’ve already played around with clustering in the correlation plot. However, it is important and so lets do it from “scratch”. Two common types of clustering are K-means and hierarchichal.

For an explanation of K-means clustering siee this video

For an explanation of hierarchical clustering siee this video

You can cluster any matrix. Today we will perform hierarchical clustering of the matrix produced by the pearson correlation.

pheatmap(mat = gene_cor_pearson,
         border_color = "black",
         cluster_rows = T,
         cluster_cols = T )

pheatmap(mat = gene_cor_pearson,
         clustering_method = "ward.D2", #specify clustering method
         border_color = "black",
         cluster_rows = T,
         cluster_cols = T )


heatmap_plot <- pheatmap(mat = gene_cor_pearson,
                         clustering_method = "ward.D2", #specify clustering method
                         clustering_distance_rows = "euclidean",
                           clustering_distance_columns = "euclidean",
                         border_color = "black",
                         cluster_rows = T,
                         cluster_cols = T )


plot(heatmap_plot$tree_row)

# keep only DUX4 targets
cor_data_targets <- data_genelevel %>% filter(target=="target") 

# take a look at column names
colnames(cor_data_targets)

# only keep the counts and convert to matrix
cor_data_targets_select <- as.matrix(cor_data_targets[,2:13])


#add rownames as gene symbols for easier clustering visualization
rownames(cor_data_targets_select) <- cor_data_targets$gene_symbol

# do all targets behave the same?
heatmap_cluster_targets <- pheatmap(mat = log10(cor_data_targets_select + 1),
                         clustering_method = "ward.D2", #specify clustering method
                         clustering_distance_rows = "euclidean",
                         border_color = "black",
                         cluster_rows = T,
                         cluster_cols = F # no need to cluster columns
                         )


plot(heatmap_cluster_targets$tree_row)

Principle components analysis (PCA)

PCA is a common dimensionality reduction method that is used to visualize the similarity and differences in your data. Please watch this fantastic 5 minute video explaining PCA

Statquest really is the best! Anyway, for more detailed explanations go here and here.

We will use the prcomp() function to perform PCA analysis. Please note that you should typically center and scale your data.

pca_data <- prcomp(log10(cor_data + 1), center = T, scale. = T) 

summary(pca_data) # summarize the PCs by variance
pca_data_info <- summary(pca_data) # assing the summary to the variable 'pca_data_info'
pca_data_info$importance[2,1] # then we can pull specific numbers out of 'pca_data_info'

Let’s say we want ask how different the samples are to each other using the variance accross all genes.

pca_plot_data <- data.frame(pca_data$rotation) # we make a dataframe out of the rotations and will use this to plot


ggplot(pca_plot_data, aes(x=PC2, y=PC1)) +
  geom_point(size=2) +
  geom_text_repel(label=rownames(pca_plot_data)) +
  theme_few() +
  labs(ylab(paste("PC1 (%",100*round(pca_data_info$importance[2,1], digits = 3),")", sep = "")
)) +
  labs(xlab(paste("PC2 (%",100*round(pca_data_info$importance[2,2], digits = 3),")", sep = "")
)) +
  ggtitle("PCA analysis") 


pca_plot_data$ID <- rownames(pca_plot_data)
pca_plot_data <- pca_plot_data %>% separate(col = ID, sep = "_", into = c("time","rep"))


ggplot(pca_plot_data, aes(x=PC2, y=PC1)) +
  geom_point(size=2, aes(shape = rep, color = time)) +
  geom_text_repel(label=rownames(pca_plot_data)) +
  scale_color_manual(values = rev(viridis(4, option="D", end = .8))) +
  theme_few() +
  labs(ylab(paste("PC1 (%",100*round(pca_data_info$importance[2,1], digits = 3),")", sep = "")
)) +
  labs(xlab(paste("PC2 (%",100*round(pca_data_info$importance[2,2], digits = 3),")", sep = "")
)) +
  ggtitle("PCA analysis") 

Actual biological questions

Is DUX4 induced? What happens to DUX4 target genes with DUX4 induction over time? Do they go up/down/all around? Do they do anything special?

head(long_genelevel)

ggplot(data = long_genelevel  %>% filter(gene_symbol=="DUXA"), aes(x = Sample, y = Count + 1)) +
  geom_point() +
  scale_y_log10() +
  theme_few() +
  theme(legend.position="right") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("log10 DUXA counts")

ggplot(data = long_genelevel, aes(x = Sample, y = Count +1, fill = Target)) +
  geom_boxplot() +
  scale_y_log10() +
  theme_few() +
  theme(legend.position="right") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("DUX4 targets vs not targets")


ggplot(data = long_genelevel, aes(x = Sample, y = Count +1, fill = Target)) +
  geom_boxplot(outlier.shape=NA) +
  scale_y_log10() +
  theme_few() +
  theme(legend.position="right") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("DUX4 targets vs not targets")

Finally, we will always end our Rmarkdown documents with the sessing information:

sessionInfo()